In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.

Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering / Martino, Alessio; Rizzi, Antonello; FRATTALE MASCIOLI, Fabio Massimo. - ELETTRONICO. - 2018:(2018), pp. 1-8. (Intervento presentato al convegno International Joint Conference on Neural Networks (IJCNN) 2018 tenutosi a Rio De Janeiro; Brazil) [10.1109/IJCNN.2018.8489101].

Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering

Alessio Martino
Primo
;
Antonello Rizzi
Secondo
;
Fabio Massimo Frattale Mascioli
Ultimo
2018

Abstract

In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed k-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.
2018
International Joint Conference on Neural Networks (IJCNN) 2018
data clustering; unsupervised learning; big data mining; large-scale pattern recognition; distributed computing
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering / Martino, Alessio; Rizzi, Antonello; FRATTALE MASCIOLI, Fabio Massimo. - ELETTRONICO. - 2018:(2018), pp. 1-8. (Intervento presentato al convegno International Joint Conference on Neural Networks (IJCNN) 2018 tenutosi a Rio De Janeiro; Brazil) [10.1109/IJCNN.2018.8489101].
File allegati a questo prodotto
File Dimensione Formato  
Martino_Distance-Matrix_2018.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.42 MB
Formato Adobe PDF
1.42 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1122005
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 22
  • ???jsp.display-item.citation.isi??? 0
social impact